58 research outputs found

    Precise event sampling on AMD versus intel: quantitative and qualitative comparison

    Get PDF
    Precise event sampling is a profiling feature in commodity processors that can sample hardware events and accurately locate the instructions that trigger the events. This feature has been used in a large number of tools to detect application performance issues. Although precise event sampling is readily supported in modern multicore architectures, vendor supports exhibit great differences that affect their accuracy, stability, overhead, and functionality. This work presents the most comprehensive study to date on benchmarking the event sampling features of Intel PEBS and AMD IBS and performs in-depth analysis on key differences through series of microbenchmarks. Our qualitative and quantitative analysis shows that PEBS allows finer-grained and more accurate sampling of hardware events, while IBS offers richer set of information at each sample though it suffers from lower accuracy and stability. Moreover, OS signal delivery, which is a common method used by the profiling software, introduces significant time overhead to the original overhead incurred by the hardware mechanisms in both PEBS and IBS. We also found that both PEBS and IBS have bias in sampling events across multiple different locations in a code. Lastly, we demonstrate how our findings on microbenchmarks under different thread counts hold for a full-fledged profiling tool that runs on the state-of-the-art Intel and AMD machines. Overall our detailed comparisons serve as a great reference and provide invaluable information for hardware designers and profiling tool developers

    Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

    Full text link
    Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern architectures is close to impossible. Analytic loop performance modeling is a useful way to understand the relevant bottlenecks of code execution based on simple machine models. The Roofline Model and the Execution-Cache-Memory (ECM) model are proven approaches to performance modeling of loop nests. In comparison to the Roofline model, the ECM model can also describes the single-core performance and saturation behavior on a multicore chip. We give an introduction to the Roofline and ECM models, and to stencil performance modeling using layer conditions (LC). We then present Kerncraft, a tool that can automatically construct Roofline and ECM models for loop nests by performing the required code, data transfer, and LC analysis. The layer condition analysis allows to predict optimal spatial blocking factors for loop nests. Together with the models it enables an ab-initio estimate of the potential benefits of loop blocking optimizations and of useful block sizes. In cases where LC analysis is not easily possible, Kerncraft supports a cache simulator as a fallback option. Using a 25-point long-range stencil we demonstrate the usefulness and predictive power of the Kerncraft tool.Comment: 22 pages, 5 figure

    Trends in Data Locality Abstractions for HPC Systems

    Get PDF
    The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems

    Compiler-Driven Data Layout Transformation for Heterogeneous Platforms

    No full text

    An optimized implementation of 3D seismic wave simulation with GPUs

    No full text

    Precise event sampling-based data locality tools for AMD multicore architectures

    No full text
    We propose COMDETECTIVE+, an inter-thread communication analyzer, and REUSETRACKER+, a reuse distance analyzer, that leverage the hardware features in AMD processors to support low-overhead profiling. Both tools employ the instruction-based sampling (IBS) facility and debug registers in AMD processors to detect inter-thread communication and data reuse. Different from prior arts, COMDETECTIVE+ differentiates the communication into true and false sharing, and REUSETRACKER+ measures reuse distance in private and shared caches by also considering cache line invalidation with low overhead. Both tools can attribute the communications and reuses to source code lines. To our knowledge these tools are two of the few profiling tools designed specifically for AMD x86 architectures using IBS. Our tools are timely and relevant considering the rise in numbers of AMD processor based data centers and HPC systems. We perform experiments to evaluate the accuracy and overheads of the proposed tools on an AMD machine with two-socket EPYC 7352 processors. COMDETECTIVE+ exhibits high accuracy while introducing 5.14× runtime and 1.4× memory overheads. REUSETRACKER+ also displays high accuracy, which is 95%, with 11.76×runtime and 1.46× memory overheads. These overheads are much lower than the overheads of existing simulators and code instrumentation-based tools. Lastly, we demonstrate the usage of the tools by having COMDETECTIVE+ and REUSETRACKER+ facilitate the code refactoring of two data mining benchmarks to improve their performance by up to 29%
    corecore